Parameterized Complexity of Feature Selection for Categorical Data Clustering
نویسندگان
چکیده
We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While is one of the most common approaches reduce dimensionality practice, known are heuristics. study following mathematical model. assume that there some inadvertent (or undesirable) features input unnecessarily increase cost Consequently, we want select a subset original from such small-cost clustering on selected features. More precisely, given integers ℓ (the number irrelevant features) and k clusters), budget B , set n points (represented by m -dimensional vectors whose elements belong finite values Σ ), − relevant any optimal -clustering these does not exceed . Here cluster sum Hamming distances (ℓ 0 -distances) between its center. The total costs clusters. use framework parameterized complexity identify how problem depends parameters | |. Our main result an algorithm solves Feature Selection time f ( |) · g 2 functions In other words, fixed-parameter tractable when constants. based solution more general problem, Constrained Clustering Outliers. this delete certain outliers remaining could be clustered around centers satisfying specific constraints. One interesting fact about Outliers besides Selection, it encompasses many fundamental problems regarding as Robust Clustering, Binary Boolean Low-rank Matrix Approximation Thus byproduct our theorem, obtain algorithms all problems. also complement findings lower bounds.
منابع مشابه
Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines
In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...
متن کاملOptimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines
In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...
متن کاملCentral Clustering of Categorical Data with Automated Feature Weighting
The ability to cluster high-dimensional categorical data is essential for many machine learning applications such as bioinfomatics. Currently, central clustering of categorical data is a difficult problem due to the lack of a geometrically interpretable definition of a cluster center. In this paper, we propose a novel kernel-density-based definition using a Bayes-type probability estimator. The...
متن کاملFeature Selection for Clustering
Clustering is an important data mining task Data mining often concerns large and high dimensional data but unfortunately most of the clustering algorithms in the literature are sensitive to largeness or high dimensionality or both Di erent features a ect clusters di erently some are important for clusters while others may hinder the clustering task An e cient way of handling it is by selecting ...
متن کاملClustering Ensembles for Categorical Data
Cluster ensembles offer a solution to challenges inherent to clustering arising from its ill-posed nature. In this paper we focus on the design of ensembles for categorical data. Our approach leverages diverse input clusterings discovered in random subspaces. We experimentally demostrate the efficacy of our technique in combination with the categorical clustering algorithm COOLCAT.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: ACM Transactions on Computation Theory
سال: 2023
ISSN: ['1942-3454', '1942-3462']
DOI: https://doi.org/10.1145/3604797